Document matching on CCITT Group 4 compressed images

نویسنده

  • Jonathan J. Hull
چکیده

A method is proposed for detecting whether two CCITT group 4 images were scanned from the same document. Features are extracted from rectangular patches of text and compared with a modified Hausdorff distance measure. Two images are said to be ‘‘equivalent’’ (i.e., they were scanned from the same document) if the Hausdorff measure finds that a specified number of features are located within a given distance of one another in both images. This paper explains the technique and presents experimental results that demonstrate its effectiveness. It is shown that features extracted from a one-inch square patch of image data provide better than 95% correct retrieval accuracy with no false positives on a database of 800 documents.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Word Searching in CCITT Group 4 Compressed Document Images

In this paper, we present a compressed pattern matching method for searching user queried words in the CCITT Group 4 compressed document images, without decompressing. The feature pixels composed of black changing elements and white changing elements are extracted directly from the CCITT Group 4 compressed document images. The connected components are labeled based on a line-by-line strategy ac...

متن کامل

Keyword Searching in Compressed Document Images

A huge amount of document images are accessible in the Internet and digital libraries. We find that, most of them are packed in PDF files and are compressed using CCITT Group 4 standards for saving storage space and speeding up transmission. There is thus significant meaning to develop the methods of directly searching keywords from these documents. In this paper, we present a compressed patter...

متن کامل

Similarity measure for CCITT Group 4 compressed document images

Similarity measure of document images acts a crucial role in the area of document image retrieval. A method of measuring the similarity of CCITT Group 4 compressed document images is proposed in this paper. The features are extracted directly from the changing elements of the compressed images. Weighted Hausdorff distance is utilized to assign all of the word objects from two document images to...

متن کامل

Document retrieval from compressed images

With the emergence of digital libraries, more and more documents are stored and transmitted through the Internet in the format of compressed images. It is of signi/cant meaning to develop a system which is capable of retrieving documents from these compressed document images. Aiming at the popular compression standard-CCITT Group 4 which is widely used for compressing document images, we presen...

متن کامل

Group 4 Compressed Document Matching

Numerous approaches, including textual, structural and featural, for detecting duplicate documents have been investigated. Considering document images are usually stored and transmitted in compressed forms, it is advantageous to perform document matching directly on the compressed data. A two-stage process for matching Group 4 compressed document images is presented. In the coarse matching stag...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1997